Breast cancer is one of the most common types of cancer among women worldwide. Early detection and accurate classification of breast tumors into benign (non-cancerous) and malignant (cancerous) categories are crucial for effective treatment and improved patient outcomes. This classification task involves analyzing various characteristics of the tumor, derived from medical imaging and other diagnostic methods, to determine its nature.
The objective of this project is to develop a machine learning model that accurately classifies breast tumors as benign or malignant based on a set of diagnostic features extracted from medical images. The primary goal is to improve the accuracy and reliability of the classification to assist healthcare professionals in making informed decisions about patient care.
The dataset consists of id, diagnosis as well as several features that describe the characteristics of breast tumors, including:
The target variable is:
Given a set of diagnostic features describing breast tumors, build a predictive model that can classify the tumors into benign or malignant categories with high accuracy. The model should be evaluated based on its ability to correctly identify malignant tumors (high recall) while maintaining overall accuracy and precision.
!pip install tensorflow==2.15.0 scikit-learn==1.2.2 matplotlib===3.7.1 seaborn==0.13.1 numpy==1.25.2 pandas==1.5.3 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the below again.
# Libraries for data manipulation and analysis
import pandas as pd
import numpy as np
# Libraries for data splitting and preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# Libraries for evaluating model performance
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
classification_report,
roc_curve,
precision_recall_curve,
auc
)
# Time related functions.
import time
# Libraries for handling imbalanced data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Libraries for model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# Libraries for deep learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input, Dropout, BatchNormalization
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import backend
# Libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Setting display options for Pandas
pd.set_option("display.max_columns", None)
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# Suppressing warnings
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
#Reading the dataset.
data = pd.read_csv('/content/drive/MyDrive/ benign and malign cancer classification/fufi-cancer/data.csv')
# Let's view the first 5 rows of the data
data.head()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.990 | 10.380 | 122.800 | 1001.000 | 0.118 | 0.278 | 0.300 | 0.147 | 0.242 | 0.079 | 1.095 | 0.905 | 8.589 | 153.400 | 0.006 | 0.049 | 0.054 | 0.016 | 0.030 | 0.006 | 25.380 | 17.330 | 184.600 | 2019.000 | 0.162 | 0.666 | 0.712 | 0.265 | 0.460 | 0.119 | NaN |
| 1 | 842517 | M | 20.570 | 17.770 | 132.900 | 1326.000 | 0.085 | 0.079 | 0.087 | 0.070 | 0.181 | 0.057 | 0.543 | 0.734 | 3.398 | 74.080 | 0.005 | 0.013 | 0.019 | 0.013 | 0.014 | 0.004 | 24.990 | 23.410 | 158.800 | 1956.000 | 0.124 | 0.187 | 0.242 | 0.186 | 0.275 | 0.089 | NaN |
| 2 | 84300903 | M | 19.690 | 21.250 | 130.000 | 1203.000 | 0.110 | 0.160 | 0.197 | 0.128 | 0.207 | 0.060 | 0.746 | 0.787 | 4.585 | 94.030 | 0.006 | 0.040 | 0.038 | 0.021 | 0.022 | 0.005 | 23.570 | 25.530 | 152.500 | 1709.000 | 0.144 | 0.424 | 0.450 | 0.243 | 0.361 | 0.088 | NaN |
| 3 | 84348301 | M | 11.420 | 20.380 | 77.580 | 386.100 | 0.142 | 0.284 | 0.241 | 0.105 | 0.260 | 0.097 | 0.496 | 1.156 | 3.445 | 27.230 | 0.009 | 0.075 | 0.057 | 0.019 | 0.060 | 0.009 | 14.910 | 26.500 | 98.870 | 567.700 | 0.210 | 0.866 | 0.687 | 0.258 | 0.664 | 0.173 | NaN |
| 4 | 84358402 | M | 20.290 | 14.340 | 135.100 | 1297.000 | 0.100 | 0.133 | 0.198 | 0.104 | 0.181 | 0.059 | 0.757 | 0.781 | 5.438 | 94.440 | 0.011 | 0.025 | 0.057 | 0.019 | 0.018 | 0.005 | 22.540 | 16.670 | 152.200 | 1575.000 | 0.137 | 0.205 | 0.400 | 0.163 | 0.236 | 0.077 | NaN |
# Removing the last column as it is empty
data = data.iloc[:,:-1]
data.head()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 842302 | M | 17.990 | 10.380 | 122.800 | 1001.000 | 0.118 | 0.278 | 0.300 | 0.147 | 0.242 | 0.079 | 1.095 | 0.905 | 8.589 | 153.400 | 0.006 | 0.049 | 0.054 | 0.016 | 0.030 | 0.006 | 25.380 | 17.330 | 184.600 | 2019.000 | 0.162 | 0.666 | 0.712 | 0.265 | 0.460 | 0.119 |
| 1 | 842517 | M | 20.570 | 17.770 | 132.900 | 1326.000 | 0.085 | 0.079 | 0.087 | 0.070 | 0.181 | 0.057 | 0.543 | 0.734 | 3.398 | 74.080 | 0.005 | 0.013 | 0.019 | 0.013 | 0.014 | 0.004 | 24.990 | 23.410 | 158.800 | 1956.000 | 0.124 | 0.187 | 0.242 | 0.186 | 0.275 | 0.089 |
| 2 | 84300903 | M | 19.690 | 21.250 | 130.000 | 1203.000 | 0.110 | 0.160 | 0.197 | 0.128 | 0.207 | 0.060 | 0.746 | 0.787 | 4.585 | 94.030 | 0.006 | 0.040 | 0.038 | 0.021 | 0.022 | 0.005 | 23.570 | 25.530 | 152.500 | 1709.000 | 0.144 | 0.424 | 0.450 | 0.243 | 0.361 | 0.088 |
| 3 | 84348301 | M | 11.420 | 20.380 | 77.580 | 386.100 | 0.142 | 0.284 | 0.241 | 0.105 | 0.260 | 0.097 | 0.496 | 1.156 | 3.445 | 27.230 | 0.009 | 0.075 | 0.057 | 0.019 | 0.060 | 0.009 | 14.910 | 26.500 | 98.870 | 567.700 | 0.210 | 0.866 | 0.687 | 0.258 | 0.664 | 0.173 |
| 4 | 84358402 | M | 20.290 | 14.340 | 135.100 | 1297.000 | 0.100 | 0.133 | 0.198 | 0.104 | 0.181 | 0.059 | 0.757 | 0.781 | 5.438 | 94.440 | 0.011 | 0.025 | 0.057 | 0.019 | 0.018 | 0.005 | 22.540 | 16.670 | 152.200 | 1575.000 | 0.137 | 0.205 | 0.400 | 0.163 | 0.236 | 0.077 |
# Let's view the last 5 rows of the data
data.tail()
| id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 564 | 926424 | M | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | ... | 25.450 | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 |
| 565 | 926682 | M | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | ... | 23.690 | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 |
| 566 | 926954 | M | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | ... | 18.980 | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 |
| 567 | 927241 | M | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | ... | 25.740 | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 |
| 568 | 92751 | B | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | ... | 9.456 | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 |
5 rows × 32 columns
# Checking the number of rows and columns in the data
data.shape
(569, 32)
# Let's check the datatypes of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 dtypes: float64(30), int64(1), object(1) memory usage: 142.4+ KB <class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 dtypes: float64(30), int64(1), object(1) memory usage: 142.4+ KB
# Let's check for duplicate values in the data
data.duplicated().sum()
0
# Let's check for missing values in the data
round(data.isnull().sum() / data.isnull().count() * 100, 2)
id 0.0 diagnosis 0.0 radius_mean 0.0 texture_mean 0.0 perimeter_mean 0.0 area_mean 0.0 smoothness_mean 0.0 compactness_mean 0.0 concavity_mean 0.0 concave points_mean 0.0 symmetry_mean 0.0 fractal_dimension_mean 0.0 radius_se 0.0 texture_se 0.0 perimeter_se 0.0 area_se 0.0 smoothness_se 0.0 compactness_se 0.0 concavity_se 0.0 concave points_se 0.0 symmetry_se 0.0 fractal_dimension_se 0.0 radius_worst 0.0 texture_worst 0.0 perimeter_worst 0.0 area_worst 0.0 smoothness_worst 0.0 compactness_worst 0.0 concavity_worst 0.0 concave points_worst 0.0 symmetry_worst 0.0 fractal_dimension_worst 0.0 dtype: float64
data["diagnosis"].value_counts(1)
B 0.627417 M 0.372583 Name: diagnosis, dtype: float64
# Let's view the statistical summary of the numerical columns in the data
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| id | 569.0 | 3.037183e+07 | 1.250206e+08 | 8670.000000 | 869218.000000 | 906024.000000 | 8.813129e+06 | 9.113205e+08 |
| radius_mean | 569.0 | 1.412729e+01 | 3.524049e+00 | 6.981000 | 11.700000 | 13.370000 | 1.578000e+01 | 2.811000e+01 |
| texture_mean | 569.0 | 1.928965e+01 | 4.301036e+00 | 9.710000 | 16.170000 | 18.840000 | 2.180000e+01 | 3.928000e+01 |
| perimeter_mean | 569.0 | 9.196903e+01 | 2.429898e+01 | 43.790000 | 75.170000 | 86.240000 | 1.041000e+02 | 1.885000e+02 |
| area_mean | 569.0 | 6.548891e+02 | 3.519141e+02 | 143.500000 | 420.300000 | 551.100000 | 7.827000e+02 | 2.501000e+03 |
| smoothness_mean | 569.0 | 9.636028e-02 | 1.406413e-02 | 0.052630 | 0.086370 | 0.095870 | 1.053000e-01 | 1.634000e-01 |
| compactness_mean | 569.0 | 1.043410e-01 | 5.281276e-02 | 0.019380 | 0.064920 | 0.092630 | 1.304000e-01 | 3.454000e-01 |
| concavity_mean | 569.0 | 8.879932e-02 | 7.971981e-02 | 0.000000 | 0.029560 | 0.061540 | 1.307000e-01 | 4.268000e-01 |
| concave points_mean | 569.0 | 4.891915e-02 | 3.880284e-02 | 0.000000 | 0.020310 | 0.033500 | 7.400000e-02 | 2.012000e-01 |
| symmetry_mean | 569.0 | 1.811619e-01 | 2.741428e-02 | 0.106000 | 0.161900 | 0.179200 | 1.957000e-01 | 3.040000e-01 |
| fractal_dimension_mean | 569.0 | 6.279761e-02 | 7.060363e-03 | 0.049960 | 0.057700 | 0.061540 | 6.612000e-02 | 9.744000e-02 |
| radius_se | 569.0 | 4.051721e-01 | 2.773127e-01 | 0.111500 | 0.232400 | 0.324200 | 4.789000e-01 | 2.873000e+00 |
| texture_se | 569.0 | 1.216853e+00 | 5.516484e-01 | 0.360200 | 0.833900 | 1.108000 | 1.474000e+00 | 4.885000e+00 |
| perimeter_se | 569.0 | 2.866059e+00 | 2.021855e+00 | 0.757000 | 1.606000 | 2.287000 | 3.357000e+00 | 2.198000e+01 |
| area_se | 569.0 | 4.033708e+01 | 4.549101e+01 | 6.802000 | 17.850000 | 24.530000 | 4.519000e+01 | 5.422000e+02 |
| smoothness_se | 569.0 | 7.040979e-03 | 3.002518e-03 | 0.001713 | 0.005169 | 0.006380 | 8.146000e-03 | 3.113000e-02 |
| compactness_se | 569.0 | 2.547814e-02 | 1.790818e-02 | 0.002252 | 0.013080 | 0.020450 | 3.245000e-02 | 1.354000e-01 |
| concavity_se | 569.0 | 3.189372e-02 | 3.018606e-02 | 0.000000 | 0.015090 | 0.025890 | 4.205000e-02 | 3.960000e-01 |
| concave points_se | 569.0 | 1.179614e-02 | 6.170285e-03 | 0.000000 | 0.007638 | 0.010930 | 1.471000e-02 | 5.279000e-02 |
| symmetry_se | 569.0 | 2.054230e-02 | 8.266372e-03 | 0.007882 | 0.015160 | 0.018730 | 2.348000e-02 | 7.895000e-02 |
| fractal_dimension_se | 569.0 | 3.794904e-03 | 2.646071e-03 | 0.000895 | 0.002248 | 0.003187 | 4.558000e-03 | 2.984000e-02 |
| radius_worst | 569.0 | 1.626919e+01 | 4.833242e+00 | 7.930000 | 13.010000 | 14.970000 | 1.879000e+01 | 3.604000e+01 |
| texture_worst | 569.0 | 2.567722e+01 | 6.146258e+00 | 12.020000 | 21.080000 | 25.410000 | 2.972000e+01 | 4.954000e+01 |
| perimeter_worst | 569.0 | 1.072612e+02 | 3.360254e+01 | 50.410000 | 84.110000 | 97.660000 | 1.254000e+02 | 2.512000e+02 |
| area_worst | 569.0 | 8.805831e+02 | 5.693570e+02 | 185.200000 | 515.300000 | 686.500000 | 1.084000e+03 | 4.254000e+03 |
| smoothness_worst | 569.0 | 1.323686e-01 | 2.283243e-02 | 0.071170 | 0.116600 | 0.131300 | 1.460000e-01 | 2.226000e-01 |
| compactness_worst | 569.0 | 2.542650e-01 | 1.573365e-01 | 0.027290 | 0.147200 | 0.211900 | 3.391000e-01 | 1.058000e+00 |
| concavity_worst | 569.0 | 2.721885e-01 | 2.086243e-01 | 0.000000 | 0.114500 | 0.226700 | 3.829000e-01 | 1.252000e+00 |
| concave points_worst | 569.0 | 1.146062e-01 | 6.573234e-02 | 0.000000 | 0.064930 | 0.099930 | 1.614000e-01 | 2.910000e-01 |
| symmetry_worst | 569.0 | 2.900756e-01 | 6.186747e-02 | 0.156500 | 0.250400 | 0.282200 | 3.179000e-01 | 6.638000e-01 |
| fractal_dimension_worst | 569.0 | 8.394582e-02 | 1.806127e-02 | 0.055040 | 0.071460 | 0.080040 | 9.208000e-02 | 2.075000e-01 |
Outside of the ID column and the diagnosis variable, All other are numerical columns in the dataset.
# Let's check the number of unique values in each column
data.nunique()
id 569 diagnosis 2 radius_mean 456 texture_mean 479 perimeter_mean 522 area_mean 539 smoothness_mean 474 compactness_mean 537 concavity_mean 537 concave points_mean 542 symmetry_mean 432 fractal_dimension_mean 499 radius_se 540 texture_se 519 perimeter_se 533 area_se 528 smoothness_se 547 compactness_se 541 concavity_se 533 concave points_se 507 symmetry_se 498 fractal_dimension_se 545 radius_worst 457 texture_worst 511 perimeter_worst 514 area_worst 544 smoothness_worst 411 compactness_worst 529 concavity_worst 539 concave points_worst 492 symmetry_worst 500 fractal_dimension_worst 535 dtype: int64
for i in data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)
Unique values in diagnosis are : B 357 M 212 Name: diagnosis, dtype: int64 **************************************************
The below functions need to be defined to carry out the Exploratory Data Analysis.
# Function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
labeled_barplot(data, "diagnosis",perc=True)
def plot_features(feature_list, title):
plt.figure(figsize=(15, 10))
for i, feature in enumerate(feature_list):
plt.subplot(3, 4, i+1)
sns.histplot(data[feature], kde=True, bins=20)
plt.title(f'Distribution of {feature}')
plt.suptitle(title, fontsize=16)
plt.tight_layout()
plt.show()
# Defining a list of features to visualize
features_mean = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean']
plot_features(features_mean, 'Mean features')
Skewness: Several features such as radius_mean, perimeter_mean, area_mean, compactness_mean, concavity_mean, concave points_mean, and fractal_dimension_mean exhibit right-skewness. This shows that there are a few large values that pull the mean to the right.
Normal-like Distribution: Features like texture_mean, smoothness_mean, and symmetry_mean appear more normally distributed, suggesting they have a more balanced spread around their mean values.
features_se = ['radius_se', 'texture_se', 'perimeter_se', 'area_se',
'smoothness_se', 'compactness_se', 'concavity_se',
'concave points_se', 'symmetry_se', 'fractal_dimension_se']
plot_features(features_se, 'SE Features')
we can see that All the graphs are right-skewed, hence we can conclude that the *_se features, representing standard errors, tend to have a small range of values, often clustered towards the lower end of their respective scales, with a few outliers.
features_worst = ['radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
'smoothness_worst', 'compactness_worst', 'concavity_worst',
'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
plot_features(features_worst, 'Worst Features')
Worst features tend to have a broader range of values compared to the *_se features, with some features showing normal-like distributions and others being right-skewed with more variability.
selected_features = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'diagnosis']
sns.pairplot(data[selected_features], hue='diagnosis', palette='coolwarm')
plt.show()
# Correlation heatmap of all features
plt.figure(figsize=(16, 12))
corr = data.iloc[:, 1:].corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Strong Positive Correlations:
- radius_mean, perimeter_mean, and area_mean have very strong positive correlations with each other, close to 1.0.
- radius_mean, radius_worst, perimeter_worst, and area_worst also show very strong positive correlations, indicating that larger radii, perimeters, and areas are consistent across mean, worst, and standard error values.
- concave points_mean, concavity_mean, and compactness_mean exhibit strong positive correlations with each other. radius_worst, perimeter_worst, and area_worst correlate highly with their respective mean values, indicating consistency in these measurements.
Low or No Correlations:
- symmetry_mean and smoothness_mean generally have low correlations with most other features, suggesting these features are relatively independent.
- fractal_dimension_mean and fractal_dimension_se also show low correlations with most other features, except for some moderate negative correlations as mentioned.
Conclusion
The strong correlations among radius_mean, perimeter_mean, and area_mean suggest that these features are highly interrelated and may provide redundant information.
The clustering of features can be useful for dimensionality reduction techniques like PCA, where highly correlated features can be combined into single principal components.
data.drop(columns="id", inplace=True)
# Encoding the target variable
label_encoder = LabelEncoder()
data['diagnosis'] = label_encoder.fit_transform(data['diagnosis'])
## Separating Independent and Dependent Columns
X = data.drop(['diagnosis'],axis=1)
Y = data['diagnosis']
X.columns
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
'smoothness_mean', 'compactness_mean', 'concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst'],
dtype='object')
#Calculating the total number of nan values for each columns.
X.isnull().sum()
radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 radius_worst 0 texture_worst 0 perimeter_worst 0 area_worst 0 smoothness_worst 0 compactness_worst 0 concavity_worst 0 concave points_worst 0 symmetry_worst 0 fractal_dimension_worst 0 dtype: int64
Q1 = data.quantile(0.25) # To find the 25th percentile and 75th percentile.
Q3 = data.quantile(0.75)
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
lower = (
Q1 - 1.5 * IQR
) # Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper = Q3 + 1.5 * IQR
(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
diagnosis 0.000 radius_mean 2.460 texture_mean 1.230 perimeter_mean 2.285 area_mean 4.394 smoothness_mean 1.054 compactness_mean 2.812 concavity_mean 3.163 concave points_mean 1.757 symmetry_mean 2.636 fractal_dimension_mean 2.636 radius_se 6.678 texture_se 3.515 perimeter_se 6.678 area_se 11.424 smoothness_se 5.272 compactness_se 4.921 concavity_se 3.866 concave points_se 3.339 symmetry_se 4.745 fractal_dimension_se 4.921 radius_worst 2.988 texture_worst 0.879 perimeter_worst 2.636 area_worst 6.151 smoothness_worst 1.230 compactness_worst 2.812 concavity_worst 2.109 concave points_worst 0.000 symmetry_worst 4.042 fractal_dimension_worst 4.218 dtype: float64
After identifying outliers, we can decide whether to remove/treat them or not. Here we are not going to treat them as there will be outliers in real case scenario and we would want our model to learn the underlying pattern for such customers.
Pearson Correlation Test
Rationale: To quantify the strength and direction of linear relationships between numerical variables.
Implementation: We can perform a Pearson correlation test between numerical variables to identify correlations that may not be captured by the correlation matrix.
features = data.columns.difference(['id', 'diagnosis'])
from scipy.stats import pearsonr, boxcox
from scipy.stats import ttest_ind
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score
correlations = {}
for feature in features:
corr, _ = pearsonr(data[feature], data['diagnosis'])
correlations[feature] = corr
print("Pearson Correlation Coefficients:")
print(correlations)
Pearson Correlation Coefficients:
{'area_mean': 0.70898383658539, 'area_se': 0.5482359402780242, 'area_worst': 0.7338250349210509, 'compactness_mean': 0.5965336775082533, 'compactness_se': 0.2929992442488583, 'compactness_worst': 0.590998237841792, 'concave points_mean': 0.7766138400204355, 'concave points_se': 0.4080423327165046, 'concave points_worst': 0.79356601714127, 'concavity_mean': 0.6963597071719059, 'concavity_se': 0.2537297659808304, 'concavity_worst': 0.659610210369233, 'fractal_dimension_mean': -0.012837602698432374, 'fractal_dimension_se': 0.07797241739025611, 'fractal_dimension_worst': 0.32387218872082396, 'perimeter_mean': 0.742635529725833, 'perimeter_se': 0.5561407034314831, 'perimeter_worst': 0.7829141371737594, 'radius_mean': 0.7300285113754563, 'radius_se': 0.5671338208247176, 'radius_worst': 0.7764537785950394, 'smoothness_mean': 0.35855996508593213, 'smoothness_se': -0.06701601057948735, 'smoothness_worst': 0.42146486106640263, 'symmetry_mean': 0.3304985542625471, 'symmetry_se': -0.006521755870647955, 'symmetry_worst': 0.4162943110486191, 'texture_mean': 0.41518529984520436, 'texture_se': -0.008303332973877408, 'texture_worst': 0.45690282139679816}
Concave Points and Perimeter: concave points_mean (0.777), concave points_worst (0.794), and concave points_se (0.408) have strong positive correlations with perimeter_mean (0.743), perimeter_worst (0.783), and perimeter_se (0.556). This indicates that as the concave points increase, the perimeter of the tumor also increases.
Radius and Perimeter: radius_mean (0.730), radius_worst (0.776), and radius_se (0.567) show strong positive correlations with perimeter_mean, perimeter_worst, and perimeter_se. This suggests that tumors with larger radii tend to have larger perimeters.
Highly Correlated Features: Concave points, radius, and perimeter show strong positive correlations, indicating these features are related to the size and shape of the tumor.
Moderately Correlated Features: Compactness and area also show moderate to strong correlations, suggesting their importance in tumor characterization.
Weakly Correlated Features: Fractal dimension and symmetry generally have lower correlations, indicating they may not be as strongly associated with other features in this dataset.
# Defining Function for transformation visualizations
def visualize_transformations(feature_name,t):
plt.figure(figsize=(15, 5))
# Original
plt.subplot(1, 3, 1)
sns.histplot(data[feature_name], bins=20, kde=True, color='skyblue')
plt.title(f'Original {feature_name}')
plt.xlabel(feature_name)
plt.ylabel('Frequency')
if (t == 0 ):
# Log Transformed
plt.subplot(1, 3, 2)
sns.histplot(data[f'{feature_name}_log'], bins=20, kde=True, color='salmon')
plt.title(f'Log Transformed {feature_name}')
plt.xlabel(f'Log {feature_name}')
plt.ylabel('Frequency')
if (t == 1):
# Box-Cox Transformed
plt.subplot(1, 3, 2)
sns.histplot(data[f'{feature_name}_boxcox'], bins=20, kde=True, color='green')
plt.title(f'Box-Cox Transformed {feature_name}')
plt.xlabel(f'Box-Cox {feature_name}')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# Log transformation
data['area_mean_log'] = np.log(data['area_mean'] + 1) # Adding 1 to avoid log(0)
data['area_se_log'] = np.log(data['area_se'] + 1)
data['area_worst_log'] = np.log(data['area_worst'] + 1)
visualize_transformations("area_mean",0)
visualize_transformations("area_se",0)
visualize_transformations("area_worst",0)
# Box-Cox transformation
data['area_mean_boxcox'], _ = boxcox(data['area_mean'] + 1)
data['area_se_boxcox'], _ = boxcox(data['area_se'] + 1)
data['area_worst_boxcox'], _ = boxcox(data['area_worst'] + 1)
visualize_transformations("area_mean",1)
visualize_transformations("area_se",1)
visualize_transformations("area_worst",1)
# Splitting the dataset into the Training and Test set.
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.2, random_state = 42,stratify = Y)
# Splitting the Train dataset into the Training and Validation set.
X_train, X_valid, y_train, y_valid = train_test_split(X_train,y_train, test_size = 0.2, random_state = 42,stratify = y_train)
# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_valid = scaler.transform(X_valid)
#Printing the shapes.
print(X_train.shape,y_train.shape)
print(X_valid.shape,y_valid.shape)
print(X_test.shape,y_test.shape)
(364, 30) (364,) (91, 30) (91,) (114, 30) (114,)
def plot(history, name):
"""
Function to plot loss/accuracy
history: an object which stores the metrics and losses.
name: can be one of Loss or Accuracy
"""
fig, ax = plt.subplots() #Creating a subplot with figure and axes.
plt.plot(history.history[name]) #Plotting the train accuracy or train loss
plt.plot(history.history['val_'+name]) #Plotting the validation accuracy or validation loss
plt.title('Model ' + name.capitalize()) #Defining the title of the plot.
plt.ylabel(name.capitalize()) #Capitalizing the first letter.
plt.xlabel('Epoch') #Defining the label for the x-axis.
fig.legend(['Train', 'Validation'], loc="outside right upper") #Defining the legend, loc controls the position of the legend.
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification(
model, predictors, target, threshold=0.5
):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
threshold: threshold for classifying the observation as class 1
"""
# checking which probabilities are greater than threshold
pred = model.predict(predictors) > threshold
# pred_temp = model.predict(predictors) > threshold
# # rounding off the above values to get classes
# pred = np.round(pred_temp)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred, average='weighted') # to compute Recall
precision = precision_score(target, pred, average='weighted') # to compute Precision
f1 = f1_score(target, pred, average='weighted') # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1 Score": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
Model can make wrong predictions in benign and malignant cancer classification as:
Which case is more important?
How to reduce this loss i.e. need to reduce False Negatives?
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_valid, model.predict(X_valid))
print("{}: {}".format(name, scores_val))
Training Performance: Logistic regression: 0.9705882352941176 Bagging: 0.9779411764705882 Random forest: 1.0 GBM: 1.0 Adaboost: 1.0 Xgboost: 1.0 dtree: 1.0 Validation Performance: Logistic regression: 0.9705882352941176 Bagging: 0.9117647058823529 Random forest: 0.9411764705882353 GBM: 0.9411764705882353 Adaboost: 0.9411764705882353 Xgboost: 0.9411764705882353 dtree: 0.9117647058823529
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 136 Before Oversampling, counts of label 'No': 228 After Oversampling, counts of label 'Yes': 228 After Oversampling, counts of label 'No': 228 After Oversampling, the shape of train_X: (456, 30) After Oversampling, the shape of train_y: (456,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_valid, model.predict(X_valid))
print("{}: {}".format(name, scores))
Training Performance: Logistic regression: 0.9780701754385965 Bagging: 1.0 Random forest: 1.0 GBM: 1.0 Adaboost: 1.0 Xgboost: 1.0 dtree: 1.0 Validation Performance: Logistic regression: 0.9705882352941176 Bagging: 0.9117647058823529 Random forest: 0.9411764705882353 GBM: 0.9705882352941176 Adaboost: 0.9705882352941176 Xgboost: 0.9705882352941176 dtree: 0.9705882352941176
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label '1': 136 Before Under Sampling, counts of label '0': 228 After Under Sampling, counts of label '1': 136 After Under Sampling, counts of label '0': 136 After Under Sampling, the shape of train_X: (272, 30) After Under Sampling, the shape of train_y: (272,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic regression", LogisticRegression(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("Xgboost", XGBClassifier(random_state=1, eval_metric="logloss")))
models.append(("dtree", DecisionTreeClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_valid, model.predict(X_valid))
print("{}: {}".format(name, scores))
Training Performance: Logistic regression: 0.9779411764705882 Bagging: 1.0 Random forest: 1.0 GBM: 1.0 Adaboost: 1.0 Xgboost: 1.0 dtree: 1.0 Validation Performance: Logistic regression: 0.9705882352941176 Bagging: 0.9411764705882353 Random forest: 0.9411764705882353 GBM: 0.9705882352941176 Adaboost: 0.9411764705882353 Xgboost: 0.9705882352941176 dtree: 0.8823529411764706
XGboost has the best performance followed by Logistic regression as per the validation performance
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,300,50),'scale_pos_weight':[0,1,2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05], 'gamma':[0,1,3,5],
'subsample':[0.7,0.8,0.9,1]
}
from sklearn import metrics
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 2, 'n_estimators': 50, 'learning_rate': 0.01, 'gamma': 1} with CV score=0.9851851851851852:
tuned_xgb1 = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.9,
scale_pos_weight=10,
n_estimators=200,
learning_rate=0.01,
gamma=1,
)
tuned_xgb1.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=1, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=1, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Checking model's performance on train set
xgb1_train = model_performance_classification_sklearn(
tuned_xgb1, X_train_un, y_train_un
)
xgb1_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982 | 1.000 | 0.965 | 0.982 |
# Checking model's performance on validation set
xgb1_val = model_performance_classification_sklearn(tuned_xgb1, X_valid, y_valid)
xgb1_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.912 | 0.971 | 0.825 | 0.892 |
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,300,50),'scale_pos_weight':[0,1,2,5,10],
'learning_rate':[0.01,0.1,0.2,0.05], 'gamma':[0,1,3,5],
'subsample':[0.7,0.8,0.9,1]
}
from sklearn import metrics
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 10, 'n_estimators': 150, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9703703703703704:
tuned_xgb2 = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.9,
scale_pos_weight=10,
n_estimators=200,
learning_rate=0.01,
gamma=1,
)
tuned_xgb2.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=1, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=1, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=200,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Checking model's performance on training set
xgb2_train = model_performance_classification_sklearn(tuned_xgb2, X_train, y_train)
xgb2_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.986 | 1.000 | 0.965 | 0.982 |
# Checking model's performance on validation set
xgb2_val = model_performance_classification_sklearn(tuned_xgb2, X_valid, y_valid)
xgb2_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.945 | 0.971 | 0.892 | 0.930 |
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 20, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9632275132275133:
tuned_adb1 = AdaBoostClassifier(
random_state=1,
n_estimators=90,
learning_rate=0.2,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_adb1.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=90, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=90, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
# Checking model's performance on training set
adb1_train = model_performance_classification_sklearn(
tuned_adb1, X_train_un, y_train_un
)
adb1_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Checking model's performance on validation set
adb1_val = model_performance_classification_sklearn(tuned_adb1, X_valid, y_valid)
adb1_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.934 | 0.971 | 0.868 | 0.917 |
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(10, 110, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 90, 'learning_rate': 1, 'base_estimator': DecisionTreeClassifier(max_depth=2, random_state=1)} with CV score=0.9484126984126984:
tuned_adb2 = AdaBoostClassifier(
random_state=1,
n_estimators=90,
learning_rate=0.2,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_adb2.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=90, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.2, n_estimators=90, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
# Checking model's performance on training set
adb2_train = model_performance_classification_sklearn(tuned_adb2, X_train, y_train)
adb2_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Checking model's performance on validation set
adb2_val = model_performance_classification_sklearn(tuned_adb2, X_valid, y_valid)
adb2_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.890 | 0.941 | 0.800 | 0.865 |
#defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(75,150,25),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"subsample":[0.5,0.7,1],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 125, 'max_features': 0.5, 'learning_rate': 0.05, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9703703703703704:
tuned_gbm1 = GradientBoostingClassifier(
random_state=1,
subsample=0.7,
n_estimators=125,
max_features=0.7,
learning_rate=0.2,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm1.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.2, max_features=0.7,
n_estimators=125, random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.2, max_features=0.7,
n_estimators=125, random_state=1, subsample=0.7)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Checking model's performance on training set
gbm1_train = model_performance_classification_sklearn(
tuned_gbm1, X_train_un, y_train_un
)
gbm1_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.000 | 1.000 | 1.000 | 1.000 |
# Checking model's performance on validation set
gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_valid, y_valid)
gbm1_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.945 | 0.941 | 0.914 | 0.928 |
#defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(75,150,25),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"subsample":[0.5,0.7,1],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'n_estimators': 75, 'max_features': 1, 'learning_rate': 0.2, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9407407407407407:
tuned_gbm2 = GradientBoostingClassifier(
random_state=1,
subsample=0.7,
n_estimators=125,
max_features=0.7,
learning_rate=0.2,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm2.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.2, max_features=0.7,
n_estimators=125, random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.2, max_features=0.7,
n_estimators=125, random_state=1, subsample=0.7)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Checking model's performance on training set
gbm2_train = model_performance_classification_sklearn(tuned_gbm1, X_train, y_train)
gbm2_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.992 | 1.000 | 0.978 | 0.989 |
# Checking model's performance on validation set
gbm2_val = model_performance_classification_sklearn(tuned_gbm1, X_valid, y_valid)
gbm2_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.945 | 0.941 | 0.914 | 0.928 |
# training performance comparison
models_train_comp_df = pd.concat(
[
xgb1_train.T,
xgb2_train.T,
gbm1_train.T,
gbm2_train.T,
adb1_train.T,
adb2_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"XGBoost trained with Undersampled data",
"XGBoost trained with Original data",
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Undersampled data",
"AdaBoost trained with Original data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| XGBoost trained with Undersampled data | XGBoost trained with Original data | Gradient boosting trained with Undersampled data | Gradient boosting trained with Original data | AdaBoost trained with Undersampled data | AdaBoost trained with Original data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.982 | 0.986 | 1.000 | 0.992 | 1.000 | 1.000 |
| Recall | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Precision | 0.965 | 0.965 | 1.000 | 0.978 | 1.000 | 1.000 |
| F1 | 0.982 | 0.982 | 1.000 | 0.989 | 1.000 | 1.000 |
# Validation performance comparison
models_train_comp_df = pd.concat(
[xgb1_val.T, xgb2_val.T, gbm1_val.T, gbm2_val.T, adb1_val.T, adb2_val.T], axis=1,
)
models_train_comp_df.columns = [
"XGBoost trained with Undersampled data",
"XGBoost trained with Original data",
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Undersampled data",
"AdaBoost trained with Original data",
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
| XGBoost trained with Undersampled data | XGBoost trained with Original data | Gradient boosting trained with Undersampled data | Gradient boosting trained with Original data | AdaBoost trained with Undersampled data | AdaBoost trained with Original data | |
|---|---|---|---|---|---|---|
| Accuracy | 0.912 | 0.945 | 0.945 | 0.945 | 0.934 | 0.890 |
| Recall | 0.971 | 0.971 | 0.941 | 0.941 | 0.971 | 0.941 |
| Precision | 0.825 | 0.892 | 0.914 | 0.914 | 0.868 | 0.800 |
| F1 | 0.892 | 0.930 | 0.928 | 0.928 | 0.917 | 0.865 |
# Let's check the performance on test set
gbm2_test = model_performance_classification_sklearn(tuned_gbm2, X_test, y_test)
gbm2_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974 | 0.929 | 1.000 | 0.963 |
# Initialize the list of models
models = [
("Logistic Regression", LogisticRegression(random_state=1)),
("Bagging", BaggingClassifier(random_state=1)),
("Random Forest", RandomForestClassifier(random_state=1)),
("Gradient Boosting", GradientBoostingClassifier(random_state=1)),
("AdaBoost", AdaBoostClassifier(random_state=1)),
("XGBoost", XGBClassifier(random_state=1, eval_metric="logloss")),
("Decision Tree", DecisionTreeClassifier(random_state=1))
]
# Train the models and store their predictions
y_train_pred = {}
y_valid_pred = {}
y_valid_prob = {}
for name, model in models:
model.fit(X_train, y_train)
y_train_pred[name] = model.predict(X_train)
y_valid_pred[name] = model.predict(X_valid)
if hasattr(model, "predict_proba"):
y_valid_prob[name] = model.predict_proba(X_valid)[:, 1]
else:
y_valid_prob[name] = model.decision_function(X_valid)
# Plot Confusion Matrix for each model in a grid view
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15, 15))
fig.suptitle('Confusion Matrices for All Models', fontsize=16)
axes = axes.flatten()
for i, (name, model) in enumerate(models):
cm = confusion_matrix(y_valid, y_valid_pred[name])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot(ax=axes[i], values_format='d')
axes[i].set_title(name)
# Hide the unused subplots
for j in range(i + 1, len(axes)):
fig.delaxes(axes[j])
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
Overall Performance:
Best Models:
Least Performing Model:
These observations provide valuable insights into the strengths and weaknesses of each model.
# Plot ROC Curves for each model in a single plot
plt.figure(figsize=(10, 8))
for name, model in models:
fpr, tpr, _ = roc_curve(y_valid, y_valid_prob[name])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for All Models')
plt.legend(loc='lower right')
plt.show()
High AUC Values:
Decision Tree Performance:
Bagging Classifier:
AdaBoost Classifier:
Model Overlaps:
The ROC curve plot indicates that the Logistic Regression, Random Forest, Gradient Boosting, and XGBoost models are the top performers with an AUC of 0.99. These models are highly effective in distinguishing between benign and malignant cancer cases. The Bagging and AdaBoost classifiers also perform well but slightly lower than the top-tier models. The Decision Tree, while still performing well with an AUC of 0.90, is the least effective among the models evaluated.
# Plot Precision-Recall Curves for each model in a single plot
plt.figure(figsize=(10, 8))
for name, model in models:
precision, recall, _ = precision_recall_curve(y_valid, y_valid_prob[name])
pr_auc = auc(recall, precision)
plt.plot(recall, precision, label=f'{name} (AUC = {pr_auc:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curves for All Models')
plt.legend(loc='lower left')
plt.show()
Overall Performance:
Precision and Recall Relationship:
Curve Behavior:
The plot indicates that Logistic Regression, Gradient Boosting, Random Forest, and XGBoost are the most effective models for classifying benign and malignant cancer cases, achieving high precision and recall with minimal trade-offs. AdaBoost is also effective but slightly less so, while Bagging shows more noticeable trade-offs. The Decision Tree model performs the worst, with a significant drop in precision as recall increases, making it less reliable for this classification task.
feature_names = data.columns
importances = tuned_gbm2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
As we have are dealing with an imbalance in class distribution, we will be using class weights to allow the model to give proportionally more importance to the minority class.
# Calculate class weights for imbalanced dataset
cw = (y_train.shape[0]) / np.bincount(y_train)
# Create a dictionary mapping class indices to their respective class weights
cw_dict = {}
for i in range(cw.shape[0]):
cw_dict[i] = cw[i]
cw_dict
{0: 1.5964912280701755, 1: 2.676470588235294}
# defining the batch size and # epochs upfront as we'll be using the same values for all models
epochs = 50
batch_size = 64
Let's start with a neural network consisting of two hidden layers with 14 and 7 neurons respectively activation function of ReLU. SGD as the optimizer
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(14,activation="relu",input_dim=X_train.shape[1]))
model.add(Dense(7,activation="relu"))
model.add(Dense(1,activation="sigmoid"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 434
dense_1 (Dense) (None, 7) 105
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 547 (2.14 KB)
Trainable params: 547 (2.14 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.SGD() # defining SGD as the optimizer to be used
model.compile(loss='binary_crossentropy', optimizer=optimizer)
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/50 6/6 [==============================] - 1s 50ms/step - loss: 1.4252 - val_loss: 0.6814 Epoch 2/50 6/6 [==============================] - 0s 13ms/step - loss: 1.2900 - val_loss: 0.6224 Epoch 3/50 6/6 [==============================] - 0s 14ms/step - loss: 1.1874 - val_loss: 0.5766 Epoch 4/50 6/6 [==============================] - 0s 14ms/step - loss: 1.1051 - val_loss: 0.5396 Epoch 5/50 6/6 [==============================] - 0s 9ms/step - loss: 1.0363 - val_loss: 0.5086 Epoch 6/50 6/6 [==============================] - 0s 14ms/step - loss: 0.9768 - val_loss: 0.4817 Epoch 7/50 6/6 [==============================] - 0s 10ms/step - loss: 0.9242 - val_loss: 0.4580 Epoch 8/50 6/6 [==============================] - 0s 11ms/step - loss: 0.8773 - val_loss: 0.4364 Epoch 9/50 6/6 [==============================] - 0s 13ms/step - loss: 0.8343 - val_loss: 0.4164 Epoch 10/50 6/6 [==============================] - 0s 14ms/step - loss: 0.7945 - val_loss: 0.3983 Epoch 11/50 6/6 [==============================] - 0s 9ms/step - loss: 0.7574 - val_loss: 0.3809 Epoch 12/50 6/6 [==============================] - 0s 12ms/step - loss: 0.7223 - val_loss: 0.3643 Epoch 13/50 6/6 [==============================] - 0s 8ms/step - loss: 0.6886 - val_loss: 0.3485 Epoch 14/50 6/6 [==============================] - 0s 10ms/step - loss: 0.6571 - val_loss: 0.3335 Epoch 15/50 6/6 [==============================] - 0s 8ms/step - loss: 0.6271 - val_loss: 0.3193 Epoch 16/50 6/6 [==============================] - 0s 9ms/step - loss: 0.5998 - val_loss: 0.3062 Epoch 17/50 6/6 [==============================] - 0s 11ms/step - loss: 0.5745 - val_loss: 0.2935 Epoch 18/50 6/6 [==============================] - 0s 8ms/step - loss: 0.5501 - val_loss: 0.2816 Epoch 19/50 6/6 [==============================] - 0s 10ms/step - loss: 0.5271 - val_loss: 0.2703 Epoch 20/50 6/6 [==============================] - 0s 10ms/step - loss: 0.5057 - val_loss: 0.2599 Epoch 21/50 6/6 [==============================] - 0s 12ms/step - loss: 0.4861 - val_loss: 0.2500 Epoch 22/50 6/6 [==============================] - 0s 14ms/step - loss: 0.4669 - val_loss: 0.2408 Epoch 23/50 6/6 [==============================] - 0s 16ms/step - loss: 0.4489 - val_loss: 0.2322 Epoch 24/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4323 - val_loss: 0.2242 Epoch 25/50 6/6 [==============================] - 0s 9ms/step - loss: 0.4164 - val_loss: 0.2169 Epoch 26/50 6/6 [==============================] - 0s 8ms/step - loss: 0.4018 - val_loss: 0.2100 Epoch 27/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3882 - val_loss: 0.2036 Epoch 28/50 6/6 [==============================] - 0s 11ms/step - loss: 0.3754 - val_loss: 0.1978 Epoch 29/50 6/6 [==============================] - 0s 12ms/step - loss: 0.3633 - val_loss: 0.1923 Epoch 30/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3519 - val_loss: 0.1873 Epoch 31/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3415 - val_loss: 0.1825 Epoch 32/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3318 - val_loss: 0.1779 Epoch 33/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3227 - val_loss: 0.1737 Epoch 34/50 6/6 [==============================] - 0s 12ms/step - loss: 0.3141 - val_loss: 0.1697 Epoch 35/50 6/6 [==============================] - 0s 14ms/step - loss: 0.3058 - val_loss: 0.1662 Epoch 36/50 6/6 [==============================] - 0s 11ms/step - loss: 0.2982 - val_loss: 0.1628 Epoch 37/50 6/6 [==============================] - 0s 9ms/step - loss: 0.2908 - val_loss: 0.1595 Epoch 38/50 6/6 [==============================] - 0s 13ms/step - loss: 0.2836 - val_loss: 0.1565 Epoch 39/50 6/6 [==============================] - 0s 8ms/step - loss: 0.2770 - val_loss: 0.1537 Epoch 40/50 6/6 [==============================] - 0s 10ms/step - loss: 0.2709 - val_loss: 0.1511 Epoch 41/50 6/6 [==============================] - 0s 9ms/step - loss: 0.2651 - val_loss: 0.1485 Epoch 42/50 6/6 [==============================] - 0s 13ms/step - loss: 0.2595 - val_loss: 0.1462 Epoch 43/50 6/6 [==============================] - 0s 10ms/step - loss: 0.2542 - val_loss: 0.1441 Epoch 44/50 6/6 [==============================] - 0s 12ms/step - loss: 0.2494 - val_loss: 0.1420 Epoch 45/50 6/6 [==============================] - 0s 10ms/step - loss: 0.2447 - val_loss: 0.1401 Epoch 46/50 6/6 [==============================] - 0s 13ms/step - loss: 0.2404 - val_loss: 0.1383 Epoch 47/50 6/6 [==============================] - 0s 14ms/step - loss: 0.2360 - val_loss: 0.1366 Epoch 48/50 6/6 [==============================] - 0s 13ms/step - loss: 0.2320 - val_loss: 0.1349 Epoch 49/50 6/6 [==============================] - 0s 22ms/step - loss: 0.2282 - val_loss: 0.1335 Epoch 50/50 6/6 [==============================] - 0s 16ms/step - loss: 0.2245 - val_loss: 0.1321
print("Time taken in seconds ",end-start)
Time taken in seconds 4.618805885314941
plot(history,'loss')
model_0_train_perf = model_performance_classification(model, X_train, y_train)
model_0_train_perf
12/12 [==============================] - 0s 2ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.972527 | 0.972527 | 0.972669 | 0.972567 |
model_0_valid_perf = model_performance_classification(model, X_test, y_test)
model_0_valid_perf
4/4 [==============================] - 0s 3ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.95614 | 0.95614 | 0.956869 | 0.955776 |
Train F1 score of ~0.97 and test F1 score of ~0.96 indicate consistent performance of the model between training and testing datasets.
Even though it's a good score, the rate of improvement over the epochs is still low.
Let's try adding momentum to check whether it's accelerating the learning process.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(14,activation="relu",input_dim=X_train.shape[1]))
model.add(Dense(7,activation="relu"))
model.add(Dense(1,activation="sigmoid"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 434
dense_1 (Dense) (None, 7) 105
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 547 (2.14 KB)
Trainable params: 547 (2.14 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.SGD(momentum=0.9)
model.compile(loss='binary_crossentropy', optimizer=optimizer)
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight = cw_dict)
end=time.time()
Epoch 1/50 6/6 [==============================] - 1s 78ms/step - loss: 0.9689 - val_loss: 0.3618 Epoch 2/50 6/6 [==============================] - 0s 50ms/step - loss: 0.6388 - val_loss: 0.2076 Epoch 3/50 6/6 [==============================] - 0s 18ms/step - loss: 0.4316 - val_loss: 0.1456 Epoch 4/50 6/6 [==============================] - 0s 18ms/step - loss: 0.3337 - val_loss: 0.1152 Epoch 5/50 6/6 [==============================] - 0s 30ms/step - loss: 0.2659 - val_loss: 0.0948 Epoch 6/50 6/6 [==============================] - 0s 19ms/step - loss: 0.2197 - val_loss: 0.0891 Epoch 7/50 6/6 [==============================] - 0s 21ms/step - loss: 0.1880 - val_loss: 0.0908 Epoch 8/50 6/6 [==============================] - 0s 24ms/step - loss: 0.1681 - val_loss: 0.0937 Epoch 9/50 6/6 [==============================] - 0s 19ms/step - loss: 0.1577 - val_loss: 0.0958 Epoch 10/50 6/6 [==============================] - 0s 31ms/step - loss: 0.1488 - val_loss: 0.0959 Epoch 11/50 6/6 [==============================] - 0s 36ms/step - loss: 0.1411 - val_loss: 0.0946 Epoch 12/50 6/6 [==============================] - 0s 23ms/step - loss: 0.1351 - val_loss: 0.0939 Epoch 13/50 6/6 [==============================] - 0s 20ms/step - loss: 0.1298 - val_loss: 0.0929 Epoch 14/50 6/6 [==============================] - 0s 19ms/step - loss: 0.1240 - val_loss: 0.0908 Epoch 15/50 6/6 [==============================] - 0s 17ms/step - loss: 0.1204 - val_loss: 0.0897 Epoch 16/50 6/6 [==============================] - 0s 16ms/step - loss: 0.1156 - val_loss: 0.0895 Epoch 17/50 6/6 [==============================] - 0s 16ms/step - loss: 0.1114 - val_loss: 0.0898 Epoch 18/50 6/6 [==============================] - 0s 36ms/step - loss: 0.1078 - val_loss: 0.0898 Epoch 19/50 6/6 [==============================] - 0s 38ms/step - loss: 0.1040 - val_loss: 0.0906 Epoch 20/50 6/6 [==============================] - 0s 27ms/step - loss: 0.1008 - val_loss: 0.0906 Epoch 21/50 6/6 [==============================] - 0s 29ms/step - loss: 0.0978 - val_loss: 0.0893 Epoch 22/50 6/6 [==============================] - 0s 27ms/step - loss: 0.0945 - val_loss: 0.0901 Epoch 23/50 6/6 [==============================] - 0s 31ms/step - loss: 0.0912 - val_loss: 0.0901 Epoch 24/50 6/6 [==============================] - 0s 17ms/step - loss: 0.0887 - val_loss: 0.0901 Epoch 25/50 6/6 [==============================] - 0s 33ms/step - loss: 0.0862 - val_loss: 0.0910 Epoch 26/50 6/6 [==============================] - 0s 20ms/step - loss: 0.0834 - val_loss: 0.0916 Epoch 27/50 6/6 [==============================] - 0s 17ms/step - loss: 0.0809 - val_loss: 0.0912 Epoch 28/50 6/6 [==============================] - 0s 18ms/step - loss: 0.0790 - val_loss: 0.0922 Epoch 29/50 6/6 [==============================] - 0s 18ms/step - loss: 0.0766 - val_loss: 0.0914 Epoch 30/50 6/6 [==============================] - 0s 15ms/step - loss: 0.0739 - val_loss: 0.0915 Epoch 31/50 6/6 [==============================] - 0s 16ms/step - loss: 0.0712 - val_loss: 0.0914 Epoch 32/50 6/6 [==============================] - 0s 18ms/step - loss: 0.0694 - val_loss: 0.0912 Epoch 33/50 6/6 [==============================] - 0s 16ms/step - loss: 0.0671 - val_loss: 0.0914 Epoch 34/50 6/6 [==============================] - 0s 25ms/step - loss: 0.0647 - val_loss: 0.0913 Epoch 35/50 6/6 [==============================] - 0s 15ms/step - loss: 0.0635 - val_loss: 0.0921 Epoch 36/50 6/6 [==============================] - 0s 17ms/step - loss: 0.0614 - val_loss: 0.0917 Epoch 37/50 6/6 [==============================] - 0s 15ms/step - loss: 0.0588 - val_loss: 0.0921 Epoch 38/50 6/6 [==============================] - 0s 16ms/step - loss: 0.0573 - val_loss: 0.0922 Epoch 39/50 6/6 [==============================] - 0s 19ms/step - loss: 0.0559 - val_loss: 0.0923 Epoch 40/50 6/6 [==============================] - 0s 16ms/step - loss: 0.0539 - val_loss: 0.0924 Epoch 41/50 6/6 [==============================] - 0s 14ms/step - loss: 0.0525 - val_loss: 0.0930 Epoch 42/50 6/6 [==============================] - 0s 27ms/step - loss: 0.0511 - val_loss: 0.0930 Epoch 43/50 6/6 [==============================] - 0s 38ms/step - loss: 0.0498 - val_loss: 0.0936 Epoch 44/50 6/6 [==============================] - 0s 23ms/step - loss: 0.0485 - val_loss: 0.0935 Epoch 45/50 6/6 [==============================] - 0s 31ms/step - loss: 0.0474 - val_loss: 0.0941 Epoch 46/50 6/6 [==============================] - 0s 19ms/step - loss: 0.0461 - val_loss: 0.0941 Epoch 47/50 6/6 [==============================] - 0s 17ms/step - loss: 0.0451 - val_loss: 0.0940 Epoch 48/50 6/6 [==============================] - 0s 52ms/step - loss: 0.0437 - val_loss: 0.0939 Epoch 49/50 6/6 [==============================] - 0s 30ms/step - loss: 0.0427 - val_loss: 0.0939 Epoch 50/50 6/6 [==============================] - 0s 24ms/step - loss: 0.0420 - val_loss: 0.0944
print("Time taken in seconds ",end-start)
Time taken in seconds 8.363724708557129
plot(history,'loss')
model_1_train_perf = model_performance_classification(model, X_train, y_train)
model_1_train_perf
12/12 [==============================] - 0s 5ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.994505 | 0.994505 | 0.994553 | 0.994497 |
model_1_valid_perf = model_performance_classification(model, X_valid, y_valid)
model_1_valid_perf
3/3 [==============================] - 0s 6ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.978022 | 0.978022 | 0.978022 | 0.978022 |
Let's change the optimizer to Adam This will introduce momentum as well as an adaptive learning rate
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(14,activation="relu",input_dim=X_train.shape[1]))
model.add(Dense(7,activation="relu"))
model.add(Dense(1,activation="sigmoid"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 434
dense_1 (Dense) (None, 7) 105
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 547 (2.14 KB)
Trainable params: 547 (2.14 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining Adam as the optimizer to be used
model.compile(loss='binary_crossentropy', optimizer=optimizer)
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/50 6/6 [==============================] - 1s 49ms/step - loss: 1.2124 - val_loss: 0.5648 Epoch 2/50 6/6 [==============================] - 0s 9ms/step - loss: 1.1041 - val_loss: 0.5215 Epoch 3/50 6/6 [==============================] - 0s 13ms/step - loss: 1.0164 - val_loss: 0.4838 Epoch 4/50 6/6 [==============================] - 0s 13ms/step - loss: 0.9417 - val_loss: 0.4493 Epoch 5/50 6/6 [==============================] - 0s 16ms/step - loss: 0.8755 - val_loss: 0.4178 Epoch 6/50 6/6 [==============================] - 0s 12ms/step - loss: 0.8162 - val_loss: 0.3899 Epoch 7/50 6/6 [==============================] - 0s 10ms/step - loss: 0.7612 - val_loss: 0.3644 Epoch 8/50 6/6 [==============================] - 0s 9ms/step - loss: 0.7112 - val_loss: 0.3409 Epoch 9/50 6/6 [==============================] - 0s 16ms/step - loss: 0.6640 - val_loss: 0.3188 Epoch 10/50 6/6 [==============================] - 0s 13ms/step - loss: 0.6209 - val_loss: 0.2983 Epoch 11/50 6/6 [==============================] - 0s 10ms/step - loss: 0.5796 - val_loss: 0.2793 Epoch 12/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5425 - val_loss: 0.2620 Epoch 13/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5071 - val_loss: 0.2459 Epoch 14/50 6/6 [==============================] - 0s 9ms/step - loss: 0.4751 - val_loss: 0.2316 Epoch 15/50 6/6 [==============================] - 0s 12ms/step - loss: 0.4475 - val_loss: 0.2187 Epoch 16/50 6/6 [==============================] - 0s 16ms/step - loss: 0.4211 - val_loss: 0.2074 Epoch 17/50 6/6 [==============================] - 0s 15ms/step - loss: 0.3968 - val_loss: 0.1976 Epoch 18/50 6/6 [==============================] - 0s 15ms/step - loss: 0.3751 - val_loss: 0.1886 Epoch 19/50 6/6 [==============================] - 0s 10ms/step - loss: 0.3563 - val_loss: 0.1802 Epoch 20/50 6/6 [==============================] - 0s 16ms/step - loss: 0.3368 - val_loss: 0.1727 Epoch 21/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3204 - val_loss: 0.1659 Epoch 22/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3052 - val_loss: 0.1599 Epoch 23/50 6/6 [==============================] - 0s 15ms/step - loss: 0.2913 - val_loss: 0.1545 Epoch 24/50 6/6 [==============================] - 0s 11ms/step - loss: 0.2794 - val_loss: 0.1495 Epoch 25/50 6/6 [==============================] - 0s 8ms/step - loss: 0.2683 - val_loss: 0.1447 Epoch 26/50 6/6 [==============================] - 0s 10ms/step - loss: 0.2578 - val_loss: 0.1407 Epoch 27/50 6/6 [==============================] - 0s 8ms/step - loss: 0.2490 - val_loss: 0.1365 Epoch 28/50 6/6 [==============================] - 0s 9ms/step - loss: 0.2404 - val_loss: 0.1329 Epoch 29/50 6/6 [==============================] - 0s 11ms/step - loss: 0.2324 - val_loss: 0.1295 Epoch 30/50 6/6 [==============================] - 0s 14ms/step - loss: 0.2245 - val_loss: 0.1266 Epoch 31/50 6/6 [==============================] - 0s 14ms/step - loss: 0.2178 - val_loss: 0.1238 Epoch 32/50 6/6 [==============================] - 0s 11ms/step - loss: 0.2113 - val_loss: 0.1216 Epoch 33/50 6/6 [==============================] - 0s 8ms/step - loss: 0.2045 - val_loss: 0.1194 Epoch 34/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1991 - val_loss: 0.1173 Epoch 35/50 6/6 [==============================] - 0s 13ms/step - loss: 0.1935 - val_loss: 0.1153 Epoch 36/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1882 - val_loss: 0.1135 Epoch 37/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1837 - val_loss: 0.1119 Epoch 38/50 6/6 [==============================] - 0s 11ms/step - loss: 0.1792 - val_loss: 0.1103 Epoch 39/50 6/6 [==============================] - 0s 11ms/step - loss: 0.1750 - val_loss: 0.1088 Epoch 40/50 6/6 [==============================] - 0s 9ms/step - loss: 0.1713 - val_loss: 0.1073 Epoch 41/50 6/6 [==============================] - 0s 9ms/step - loss: 0.1673 - val_loss: 0.1059 Epoch 42/50 6/6 [==============================] - 0s 8ms/step - loss: 0.1639 - val_loss: 0.1043 Epoch 43/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1603 - val_loss: 0.1030 Epoch 44/50 6/6 [==============================] - 0s 11ms/step - loss: 0.1569 - val_loss: 0.1019 Epoch 45/50 6/6 [==============================] - 0s 15ms/step - loss: 0.1543 - val_loss: 0.1008 Epoch 46/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1508 - val_loss: 0.0998 Epoch 47/50 6/6 [==============================] - 0s 11ms/step - loss: 0.1475 - val_loss: 0.0989 Epoch 48/50 6/6 [==============================] - 0s 11ms/step - loss: 0.1446 - val_loss: 0.0983 Epoch 49/50 6/6 [==============================] - 0s 9ms/step - loss: 0.1416 - val_loss: 0.0976 Epoch 50/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1390 - val_loss: 0.0969
print("Time taken in seconds ",end-start)
Time taken in seconds 6.05787992477417
plot(history,'loss')
model_2_train_perf = model_performance_classification(model, X_train, y_train)
model_2_train_perf
12/12 [==============================] - 0s 2ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.980769 | 0.980769 | 0.980813 | 0.980783 |
model_2_valid_perf = model_performance_classification(model, X_valid, y_valid)
model_2_valid_perf
3/3 [==============================] - 0s 4ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.967033 | 0.967033 | 0.967465 | 0.967126 |
The difference between the train loss and test loss is high. Let's add dropout to regularize it.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(14,activation="relu",input_dim=X_train.shape[1]))
model.add(Dropout(0.4))
model.add(Dense(7,activation="relu"))
model.add(Dropout(0.2))
model.add(Dense(1,activation="sigmoid"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 434
dropout (Dropout) (None, 14) 0
dense_1 (Dense) (None, 7) 105
dropout_1 (Dropout) (None, 7) 0
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 547 (2.14 KB)
Trainable params: 547 (2.14 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining Adam as the optimizer to be used
model.compile(loss='binary_crossentropy', optimizer=optimizer)
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/50 6/6 [==============================] - 1s 46ms/step - loss: 1.4634 - val_loss: 0.6042 Epoch 2/50 6/6 [==============================] - 0s 14ms/step - loss: 1.3817 - val_loss: 0.5707 Epoch 3/50 6/6 [==============================] - 0s 12ms/step - loss: 1.2685 - val_loss: 0.5440 Epoch 4/50 6/6 [==============================] - 0s 13ms/step - loss: 1.1812 - val_loss: 0.5203 Epoch 5/50 6/6 [==============================] - 0s 13ms/step - loss: 1.1414 - val_loss: 0.4966 Epoch 6/50 6/6 [==============================] - 0s 9ms/step - loss: 1.1433 - val_loss: 0.4739 Epoch 7/50 6/6 [==============================] - 0s 14ms/step - loss: 1.0948 - val_loss: 0.4529 Epoch 8/50 6/6 [==============================] - 0s 9ms/step - loss: 1.0006 - val_loss: 0.4328 Epoch 9/50 6/6 [==============================] - 0s 12ms/step - loss: 0.9478 - val_loss: 0.4115 Epoch 10/50 6/6 [==============================] - 0s 10ms/step - loss: 0.9495 - val_loss: 0.3904 Epoch 11/50 6/6 [==============================] - 0s 14ms/step - loss: 0.9222 - val_loss: 0.3700 Epoch 12/50 6/6 [==============================] - 0s 10ms/step - loss: 0.8022 - val_loss: 0.3493 Epoch 13/50 6/6 [==============================] - 0s 12ms/step - loss: 0.8507 - val_loss: 0.3294 Epoch 14/50 6/6 [==============================] - 0s 14ms/step - loss: 0.7671 - val_loss: 0.3104 Epoch 15/50 6/6 [==============================] - 0s 14ms/step - loss: 0.7054 - val_loss: 0.2920 Epoch 16/50 6/6 [==============================] - 0s 15ms/step - loss: 0.7029 - val_loss: 0.2752 Epoch 17/50 6/6 [==============================] - 0s 12ms/step - loss: 0.6914 - val_loss: 0.2598 Epoch 18/50 6/6 [==============================] - 0s 10ms/step - loss: 0.7127 - val_loss: 0.2462 Epoch 19/50 6/6 [==============================] - 0s 8ms/step - loss: 0.6497 - val_loss: 0.2341 Epoch 20/50 6/6 [==============================] - 0s 11ms/step - loss: 0.6342 - val_loss: 0.2229 Epoch 21/50 6/6 [==============================] - 0s 11ms/step - loss: 0.5222 - val_loss: 0.2121 Epoch 22/50 6/6 [==============================] - 0s 11ms/step - loss: 0.5786 - val_loss: 0.2018 Epoch 23/50 6/6 [==============================] - 0s 11ms/step - loss: 0.6052 - val_loss: 0.1926 Epoch 24/50 6/6 [==============================] - 0s 9ms/step - loss: 0.4961 - val_loss: 0.1841 Epoch 25/50 6/6 [==============================] - 0s 9ms/step - loss: 0.5370 - val_loss: 0.1763 Epoch 26/50 6/6 [==============================] - 0s 11ms/step - loss: 0.5275 - val_loss: 0.1697 Epoch 27/50 6/6 [==============================] - 0s 8ms/step - loss: 0.4832 - val_loss: 0.1635 Epoch 28/50 6/6 [==============================] - 0s 15ms/step - loss: 0.4263 - val_loss: 0.1576 Epoch 29/50 6/6 [==============================] - 0s 15ms/step - loss: 0.4585 - val_loss: 0.1522 Epoch 30/50 6/6 [==============================] - 0s 14ms/step - loss: 0.4448 - val_loss: 0.1473 Epoch 31/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3657 - val_loss: 0.1427 Epoch 32/50 6/6 [==============================] - 0s 8ms/step - loss: 0.4104 - val_loss: 0.1381 Epoch 33/50 6/6 [==============================] - 0s 12ms/step - loss: 0.3772 - val_loss: 0.1340 Epoch 34/50 6/6 [==============================] - 0s 11ms/step - loss: 0.3791 - val_loss: 0.1302 Epoch 35/50 6/6 [==============================] - 0s 8ms/step - loss: 0.3942 - val_loss: 0.1268 Epoch 36/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3924 - val_loss: 0.1238 Epoch 37/50 6/6 [==============================] - 0s 8ms/step - loss: 0.3664 - val_loss: 0.1213 Epoch 38/50 6/6 [==============================] - 0s 11ms/step - loss: 0.4213 - val_loss: 0.1192 Epoch 39/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3714 - val_loss: 0.1168 Epoch 40/50 6/6 [==============================] - 0s 8ms/step - loss: 0.3667 - val_loss: 0.1145 Epoch 41/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3273 - val_loss: 0.1126 Epoch 42/50 6/6 [==============================] - 0s 12ms/step - loss: 0.3315 - val_loss: 0.1106 Epoch 43/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3261 - val_loss: 0.1085 Epoch 44/50 6/6 [==============================] - 0s 11ms/step - loss: 0.3207 - val_loss: 0.1066 Epoch 45/50 6/6 [==============================] - 0s 15ms/step - loss: 0.3067 - val_loss: 0.1053 Epoch 46/50 6/6 [==============================] - 0s 11ms/step - loss: 0.3047 - val_loss: 0.1040 Epoch 47/50 6/6 [==============================] - 0s 14ms/step - loss: 0.2871 - val_loss: 0.1027 Epoch 48/50 6/6 [==============================] - 0s 12ms/step - loss: 0.2986 - val_loss: 0.1012 Epoch 49/50 6/6 [==============================] - 0s 11ms/step - loss: 0.2810 - val_loss: 0.0999 Epoch 50/50 6/6 [==============================] - 0s 10ms/step - loss: 0.3236 - val_loss: 0.0986
print("Time taken in seconds ",end-start)
Time taken in seconds 6.11498498916626
plot(history,'loss')
model_3_train_perf = model_performance_classification(model, X_train, y_train)
model_3_train_perf
12/12 [==============================] - 0s 2ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.975275 | 0.975275 | 0.975327 | 0.975293 |
model_3_valid_perf = model_performance_classification(model, X_valid, y_valid)
model_3_valid_perf
3/3 [==============================] - 0s 4ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.967033 | 0.967033 | 0.967465 | 0.967126 |
Let's add batch normalization to see whether we can stabilize the training process and thereby improve the scores.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(14,activation="relu",input_dim=X_train.shape[1]))
model.add(BatchNormalization())
model.add(Dense(7,activation="relu"))
model.add(BatchNormalization())
model.add(Dense(1,activation="sigmoid"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 434
batch_normalization (Batch (None, 14) 56
Normalization)
dense_1 (Dense) (None, 7) 105
batch_normalization_1 (Bat (None, 7) 28
chNormalization)
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 631 (2.46 KB)
Trainable params: 589 (2.30 KB)
Non-trainable params: 42 (168.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining Adam as the optimizer to be used
model.compile(loss='binary_crossentropy', optimizer=optimizer)
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/50 6/6 [==============================] - 2s 53ms/step - loss: 1.6232 - val_loss: 0.7941 Epoch 2/50 6/6 [==============================] - 0s 10ms/step - loss: 1.3951 - val_loss: 0.7192 Epoch 3/50 6/6 [==============================] - 0s 9ms/step - loss: 1.1860 - val_loss: 0.6532 Epoch 4/50 6/6 [==============================] - 0s 10ms/step - loss: 1.0641 - val_loss: 0.5933 Epoch 5/50 6/6 [==============================] - 0s 10ms/step - loss: 0.9179 - val_loss: 0.5387 Epoch 6/50 6/6 [==============================] - 0s 14ms/step - loss: 0.8237 - val_loss: 0.4896 Epoch 7/50 6/6 [==============================] - 0s 11ms/step - loss: 0.7221 - val_loss: 0.4467 Epoch 8/50 6/6 [==============================] - 0s 13ms/step - loss: 0.6648 - val_loss: 0.4107 Epoch 9/50 6/6 [==============================] - 0s 15ms/step - loss: 0.6003 - val_loss: 0.3811 Epoch 10/50 6/6 [==============================] - 0s 17ms/step - loss: 0.5577 - val_loss: 0.3552 Epoch 11/50 6/6 [==============================] - 0s 10ms/step - loss: 0.5107 - val_loss: 0.3316 Epoch 12/50 6/6 [==============================] - 0s 10ms/step - loss: 0.4807 - val_loss: 0.3117 Epoch 13/50 6/6 [==============================] - 0s 15ms/step - loss: 0.4576 - val_loss: 0.2943 Epoch 14/50 6/6 [==============================] - 0s 15ms/step - loss: 0.4400 - val_loss: 0.2793 Epoch 15/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4317 - val_loss: 0.2658 Epoch 16/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4115 - val_loss: 0.2538 Epoch 17/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3829 - val_loss: 0.2435 Epoch 18/50 6/6 [==============================] - 0s 10ms/step - loss: 0.3895 - val_loss: 0.2342 Epoch 19/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3746 - val_loss: 0.2260 Epoch 20/50 6/6 [==============================] - 0s 10ms/step - loss: 0.3465 - val_loss: 0.2183 Epoch 21/50 6/6 [==============================] - 0s 9ms/step - loss: 0.3452 - val_loss: 0.2103 Epoch 22/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3354 - val_loss: 0.2041 Epoch 23/50 6/6 [==============================] - 0s 15ms/step - loss: 0.3101 - val_loss: 0.1977 Epoch 24/50 6/6 [==============================] - 0s 19ms/step - loss: 0.3246 - val_loss: 0.1917 Epoch 25/50 6/6 [==============================] - 0s 15ms/step - loss: 0.3014 - val_loss: 0.1862 Epoch 26/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3025 - val_loss: 0.1812 Epoch 27/50 6/6 [==============================] - 0s 10ms/step - loss: 0.2857 - val_loss: 0.1768 Epoch 28/50 6/6 [==============================] - 0s 12ms/step - loss: 0.2897 - val_loss: 0.1721 Epoch 29/50 6/6 [==============================] - 0s 13ms/step - loss: 0.2642 - val_loss: 0.1684 Epoch 30/50 6/6 [==============================] - 0s 10ms/step - loss: 0.2581 - val_loss: 0.1647 Epoch 31/50 6/6 [==============================] - 0s 9ms/step - loss: 0.2664 - val_loss: 0.1614 Epoch 32/50 6/6 [==============================] - 0s 13ms/step - loss: 0.2403 - val_loss: 0.1589 Epoch 33/50 6/6 [==============================] - 0s 9ms/step - loss: 0.2339 - val_loss: 0.1560 Epoch 34/50 6/6 [==============================] - 0s 13ms/step - loss: 0.2418 - val_loss: 0.1539 Epoch 35/50 6/6 [==============================] - 0s 15ms/step - loss: 0.2221 - val_loss: 0.1517 Epoch 36/50 6/6 [==============================] - 0s 14ms/step - loss: 0.2167 - val_loss: 0.1503 Epoch 37/50 6/6 [==============================] - 0s 11ms/step - loss: 0.2227 - val_loss: 0.1482 Epoch 38/50 6/6 [==============================] - 0s 12ms/step - loss: 0.2052 - val_loss: 0.1462 Epoch 39/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1939 - val_loss: 0.1438 Epoch 40/50 6/6 [==============================] - 0s 9ms/step - loss: 0.1958 - val_loss: 0.1413 Epoch 41/50 6/6 [==============================] - 0s 9ms/step - loss: 0.2057 - val_loss: 0.1396 Epoch 42/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1845 - val_loss: 0.1374 Epoch 43/50 6/6 [==============================] - 0s 11ms/step - loss: 0.2107 - val_loss: 0.1347 Epoch 44/50 6/6 [==============================] - 0s 12ms/step - loss: 0.1879 - val_loss: 0.1325 Epoch 45/50 6/6 [==============================] - 0s 9ms/step - loss: 0.1879 - val_loss: 0.1310 Epoch 46/50 6/6 [==============================] - 0s 8ms/step - loss: 0.1914 - val_loss: 0.1292 Epoch 47/50 6/6 [==============================] - 0s 9ms/step - loss: 0.1670 - val_loss: 0.1276 Epoch 48/50 6/6 [==============================] - 0s 9ms/step - loss: 0.1995 - val_loss: 0.1262 Epoch 49/50 6/6 [==============================] - 0s 9ms/step - loss: 0.1513 - val_loss: 0.1254 Epoch 50/50 6/6 [==============================] - 0s 10ms/step - loss: 0.1888 - val_loss: 0.1247
print("Time taken in seconds ",end-start)
Time taken in seconds 5.2689220905303955
plot(history,'loss')
model_4_train_perf = model_performance_classification(model, X_train, y_train)
model_4_train_perf
12/12 [==============================] - 0s 2ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.986264 | 0.986264 | 0.986259 | 0.986253 |
model_4_valid_perf = model_performance_classification(model, X_valid, y_valid)
model_4_valid_perf
3/3 [==============================] - 0s 4ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.945055 | 0.945055 | 0.948008 | 0.945473 |
Lets add both batchnormalization and dropout.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(14,activation="relu",input_dim=X_train.shape[1]))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Dense(7,activation="relu"))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(1,activation="sigmoid"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 434
batch_normalization (Batch (None, 14) 56
Normalization)
dropout (Dropout) (None, 14) 0
dense_1 (Dense) (None, 7) 105
batch_normalization_1 (Bat (None, 7) 28
chNormalization)
dropout_1 (Dropout) (None, 7) 0
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 631 (2.46 KB)
Trainable params: 589 (2.30 KB)
Non-trainable params: 42 (168.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining Adam as the optimizer to be used
model.compile(loss='binary_crossentropy', optimizer=optimizer)
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/50 6/6 [==============================] - 2s 134ms/step - loss: 2.2824 - val_loss: 0.9303 Epoch 2/50 6/6 [==============================] - 0s 13ms/step - loss: 1.9199 - val_loss: 0.8513 Epoch 3/50 6/6 [==============================] - 0s 13ms/step - loss: 1.9057 - val_loss: 0.7777 Epoch 4/50 6/6 [==============================] - 0s 12ms/step - loss: 1.7039 - val_loss: 0.7111 Epoch 5/50 6/6 [==============================] - 0s 13ms/step - loss: 1.5211 - val_loss: 0.6513 Epoch 6/50 6/6 [==============================] - 0s 13ms/step - loss: 1.4844 - val_loss: 0.5954 Epoch 7/50 6/6 [==============================] - 0s 10ms/step - loss: 1.3091 - val_loss: 0.5419 Epoch 8/50 6/6 [==============================] - 0s 16ms/step - loss: 1.1803 - val_loss: 0.5001 Epoch 9/50 6/6 [==============================] - 0s 15ms/step - loss: 1.1472 - val_loss: 0.4649 Epoch 10/50 6/6 [==============================] - 0s 14ms/step - loss: 1.1072 - val_loss: 0.4357 Epoch 11/50 6/6 [==============================] - 0s 14ms/step - loss: 1.0342 - val_loss: 0.4105 Epoch 12/50 6/6 [==============================] - 0s 9ms/step - loss: 0.9318 - val_loss: 0.3909 Epoch 13/50 6/6 [==============================] - 0s 13ms/step - loss: 0.9124 - val_loss: 0.3741 Epoch 14/50 6/6 [==============================] - 0s 9ms/step - loss: 0.7866 - val_loss: 0.3596 Epoch 15/50 6/6 [==============================] - 0s 14ms/step - loss: 0.8059 - val_loss: 0.3472 Epoch 16/50 6/6 [==============================] - 0s 15ms/step - loss: 0.7810 - val_loss: 0.3354 Epoch 17/50 6/6 [==============================] - 0s 15ms/step - loss: 0.7326 - val_loss: 0.3242 Epoch 18/50 6/6 [==============================] - 0s 18ms/step - loss: 0.7622 - val_loss: 0.3133 Epoch 19/50 6/6 [==============================] - 0s 17ms/step - loss: 0.7890 - val_loss: 0.3035 Epoch 20/50 6/6 [==============================] - 0s 17ms/step - loss: 0.6822 - val_loss: 0.2940 Epoch 21/50 6/6 [==============================] - 0s 18ms/step - loss: 0.6406 - val_loss: 0.2850 Epoch 22/50 6/6 [==============================] - 0s 17ms/step - loss: 0.6842 - val_loss: 0.2765 Epoch 23/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5943 - val_loss: 0.2682 Epoch 24/50 6/6 [==============================] - 0s 17ms/step - loss: 0.5823 - val_loss: 0.2604 Epoch 25/50 6/6 [==============================] - 0s 13ms/step - loss: 0.6576 - val_loss: 0.2531 Epoch 26/50 6/6 [==============================] - 0s 16ms/step - loss: 0.5840 - val_loss: 0.2457 Epoch 27/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5591 - val_loss: 0.2388 Epoch 28/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5757 - val_loss: 0.2325 Epoch 29/50 6/6 [==============================] - 0s 16ms/step - loss: 0.5188 - val_loss: 0.2266 Epoch 30/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5419 - val_loss: 0.2210 Epoch 31/50 6/6 [==============================] - 0s 19ms/step - loss: 0.5026 - val_loss: 0.2155 Epoch 32/50 6/6 [==============================] - 0s 23ms/step - loss: 0.5075 - val_loss: 0.2100 Epoch 33/50 6/6 [==============================] - 0s 19ms/step - loss: 0.4762 - val_loss: 0.2043 Epoch 34/50 6/6 [==============================] - 0s 16ms/step - loss: 0.5319 - val_loss: 0.1980 Epoch 35/50 6/6 [==============================] - 0s 16ms/step - loss: 0.4891 - val_loss: 0.1921 Epoch 36/50 6/6 [==============================] - 0s 14ms/step - loss: 0.4961 - val_loss: 0.1863 Epoch 37/50 6/6 [==============================] - 0s 16ms/step - loss: 0.5113 - val_loss: 0.1806 Epoch 38/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4641 - val_loss: 0.1750 Epoch 39/50 6/6 [==============================] - 0s 16ms/step - loss: 0.5235 - val_loss: 0.1702 Epoch 40/50 6/6 [==============================] - 0s 16ms/step - loss: 0.3952 - val_loss: 0.1666 Epoch 41/50 6/6 [==============================] - 0s 18ms/step - loss: 0.4334 - val_loss: 0.1631 Epoch 42/50 6/6 [==============================] - 0s 25ms/step - loss: 0.3629 - val_loss: 0.1601 Epoch 43/50 6/6 [==============================] - 0s 19ms/step - loss: 0.4524 - val_loss: 0.1561 Epoch 44/50 6/6 [==============================] - 0s 18ms/step - loss: 0.3997 - val_loss: 0.1527 Epoch 45/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4069 - val_loss: 0.1499 Epoch 46/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3884 - val_loss: 0.1469 Epoch 47/50 6/6 [==============================] - 0s 19ms/step - loss: 0.3562 - val_loss: 0.1441 Epoch 48/50 6/6 [==============================] - 0s 17ms/step - loss: 0.3916 - val_loss: 0.1419 Epoch 49/50 6/6 [==============================] - 0s 19ms/step - loss: 0.3646 - val_loss: 0.1393 Epoch 50/50 6/6 [==============================] - 0s 16ms/step - loss: 0.4162 - val_loss: 0.1371
print("Time taken in seconds ",end-start)
Time taken in seconds 6.714226484298706
plot(history,'loss')
model_5_train_perf = model_performance_classification(model, X_train, y_train)
model_5_train_perf
12/12 [==============================] - 0s 4ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.978022 | 0.978022 | 0.978148 | 0.978054 |
model_5_valid_perf = model_performance_classification(model, X_valid, y_valid)
model_5_valid_perf
3/3 [==============================] - 0s 7ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.945055 | 0.945055 | 0.945604 | 0.94521 |
Let's initialize the weights using He normal. We'll also use only Dropout for regularization.
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(14,activation="relu",kernel_initializer="he_normal",input_dim=X_train.shape[1]))
model.add(Dropout(0.4))
model.add(Dense(7,activation="relu",kernel_initializer="he_normal"))
model.add(Dropout(0.2))
model.add(Dense(1,activation="sigmoid",kernel_initializer="he_normal"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 434
dropout (Dropout) (None, 14) 0
dense_1 (Dense) (None, 7) 105
dropout_1 (Dropout) (None, 7) 0
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 547 (2.14 KB)
Trainable params: 547 (2.14 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining Adam as the optimizer to be used
model.compile(loss='binary_crossentropy', optimizer=optimizer)
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/50 6/6 [==============================] - 1s 42ms/step - loss: 1.3991 - val_loss: 0.5368 Epoch 2/50 6/6 [==============================] - 0s 12ms/step - loss: 1.2290 - val_loss: 0.4999 Epoch 3/50 6/6 [==============================] - 0s 12ms/step - loss: 1.1842 - val_loss: 0.4675 Epoch 4/50 6/6 [==============================] - 0s 9ms/step - loss: 1.1771 - val_loss: 0.4397 Epoch 5/50 6/6 [==============================] - 0s 9ms/step - loss: 0.9848 - val_loss: 0.4135 Epoch 6/50 6/6 [==============================] - 0s 9ms/step - loss: 0.9319 - val_loss: 0.3879 Epoch 7/50 6/6 [==============================] - 0s 9ms/step - loss: 1.0105 - val_loss: 0.3671 Epoch 8/50 6/6 [==============================] - 0s 9ms/step - loss: 0.8891 - val_loss: 0.3496 Epoch 9/50 6/6 [==============================] - 0s 9ms/step - loss: 0.9048 - val_loss: 0.3323 Epoch 10/50 6/6 [==============================] - 0s 12ms/step - loss: 0.9132 - val_loss: 0.3171 Epoch 11/50 6/6 [==============================] - 0s 12ms/step - loss: 0.8925 - val_loss: 0.3023 Epoch 12/50 6/6 [==============================] - 0s 10ms/step - loss: 0.8004 - val_loss: 0.2897 Epoch 13/50 6/6 [==============================] - 0s 11ms/step - loss: 0.8642 - val_loss: 0.2776 Epoch 14/50 6/6 [==============================] - 0s 8ms/step - loss: 0.7639 - val_loss: 0.2666 Epoch 15/50 6/6 [==============================] - 0s 8ms/step - loss: 0.7039 - val_loss: 0.2565 Epoch 16/50 6/6 [==============================] - 0s 12ms/step - loss: 0.7491 - val_loss: 0.2469 Epoch 17/50 6/6 [==============================] - 0s 14ms/step - loss: 0.6784 - val_loss: 0.2378 Epoch 18/50 6/6 [==============================] - 0s 12ms/step - loss: 0.6689 - val_loss: 0.2293 Epoch 19/50 6/6 [==============================] - 0s 12ms/step - loss: 0.7061 - val_loss: 0.2214 Epoch 20/50 6/6 [==============================] - 0s 15ms/step - loss: 0.6360 - val_loss: 0.2139 Epoch 21/50 6/6 [==============================] - 0s 14ms/step - loss: 0.6256 - val_loss: 0.2069 Epoch 22/50 6/6 [==============================] - 0s 12ms/step - loss: 0.6142 - val_loss: 0.1999 Epoch 23/50 6/6 [==============================] - 0s 9ms/step - loss: 0.6569 - val_loss: 0.1936 Epoch 24/50 6/6 [==============================] - 0s 12ms/step - loss: 0.6278 - val_loss: 0.1881 Epoch 25/50 6/6 [==============================] - 0s 11ms/step - loss: 0.5947 - val_loss: 0.1833 Epoch 26/50 6/6 [==============================] - 0s 9ms/step - loss: 0.5850 - val_loss: 0.1784 Epoch 27/50 6/6 [==============================] - 0s 9ms/step - loss: 0.5409 - val_loss: 0.1740 Epoch 28/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5605 - val_loss: 0.1700 Epoch 29/50 6/6 [==============================] - 0s 12ms/step - loss: 0.5581 - val_loss: 0.1665 Epoch 30/50 6/6 [==============================] - 0s 12ms/step - loss: 0.5103 - val_loss: 0.1628 Epoch 31/50 6/6 [==============================] - 0s 10ms/step - loss: 0.5645 - val_loss: 0.1595 Epoch 32/50 6/6 [==============================] - 0s 14ms/step - loss: 0.4837 - val_loss: 0.1566 Epoch 33/50 6/6 [==============================] - 0s 11ms/step - loss: 0.5246 - val_loss: 0.1531 Epoch 34/50 6/6 [==============================] - 0s 12ms/step - loss: 0.4415 - val_loss: 0.1498 Epoch 35/50 6/6 [==============================] - 0s 9ms/step - loss: 0.4937 - val_loss: 0.1466 Epoch 36/50 6/6 [==============================] - 0s 10ms/step - loss: 0.4092 - val_loss: 0.1433 Epoch 37/50 6/6 [==============================] - 0s 14ms/step - loss: 0.4429 - val_loss: 0.1402 Epoch 38/50 6/6 [==============================] - 0s 12ms/step - loss: 0.4339 - val_loss: 0.1374 Epoch 39/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4375 - val_loss: 0.1349 Epoch 40/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4726 - val_loss: 0.1322 Epoch 41/50 6/6 [==============================] - 0s 10ms/step - loss: 0.4049 - val_loss: 0.1298 Epoch 42/50 6/6 [==============================] - 0s 12ms/step - loss: 0.3969 - val_loss: 0.1274 Epoch 43/50 6/6 [==============================] - 0s 14ms/step - loss: 0.4230 - val_loss: 0.1254 Epoch 44/50 6/6 [==============================] - 0s 10ms/step - loss: 0.4810 - val_loss: 0.1238 Epoch 45/50 6/6 [==============================] - 0s 15ms/step - loss: 0.3947 - val_loss: 0.1223 Epoch 46/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3765 - val_loss: 0.1205 Epoch 47/50 6/6 [==============================] - 0s 16ms/step - loss: 0.3654 - val_loss: 0.1187 Epoch 48/50 6/6 [==============================] - 0s 11ms/step - loss: 0.3682 - val_loss: 0.1169 Epoch 49/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3952 - val_loss: 0.1149 Epoch 50/50 6/6 [==============================] - 0s 13ms/step - loss: 0.3948 - val_loss: 0.1136
print("Time taken in seconds ",end-start)
Time taken in seconds 4.6778059005737305
plot(history,'loss')
model_6_train_perf = model_performance_classification(model, X_train, y_train)
model_6_train_perf
12/12 [==============================] - 0s 3ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.975275 | 0.975275 | 0.975327 | 0.975293 |
model_6_valid_perf = model_performance_classification(model, X_valid, y_valid)
model_6_valid_perf
3/3 [==============================] - 0s 8ms/step
| Accuracy | Recall | Precision | F1 Score | |
|---|---|---|---|---|
| 0 | 0.956044 | 0.956044 | 0.957476 | 0.956279 |
There's a slight improvement in the scores. The difference between train and validation scores has also reduced.
# training performance comparison
models_train_comp_df = pd.concat(
[
model_0_train_perf.T,
model_1_train_perf.T,
model_2_train_perf.T,
model_3_train_perf.T,
model_4_train_perf.T,
model_5_train_perf.T,
model_6_train_perf.T
],
axis=1,
)
models_train_comp_df.columns = [
"Neural Network (SGD, No Regularization)",
"Neural Network (SGD with Momentum, No Regularization)",
"Neural Network (Adam , No Regularization)",
"Neural Network (Adam, dropout [0.4,0.2])",
"Neural Network (Adam, Batch Normalization)",
"Neural Network (dropout [0.4,0.2], Batch Normalization)",
"Neural Network (Adam,dropout [0.4,0.2] ,He initialization)"
]
#Validation performance comparison
models_valid_comp_df = pd.concat(
[
model_0_valid_perf.T,
model_1_valid_perf.T,
model_2_valid_perf.T,
model_3_valid_perf.T,
model_4_valid_perf.T,
model_5_valid_perf.T,
model_6_valid_perf.T
],
axis=1,
)
models_valid_comp_df.columns = [
"Neural Network (SGD, No Regularization)",
"Neural Network (SGD with Momentum, No Regularization)",
"Neural Network (Adam , No Regularization)",
"Neural Network (Adam, dropout [0.4,0.2])",
"Neural Network (Adam, Batch Normalization)",
"Neural Network (dropout [0.4,0.2], Batch Normalization)",
"Neural Network (Adam,dropout [0.4,0.2] ,He initialization)"
]
models_train_comp_df
| Neural Network (SGD, No Regularization) | Neural Network (SGD with Momentum, No Regularization) | Neural Network (Adam , No Regularization) | Neural Network (Adam, dropout [0.4,0.2]) | Neural Network (Adam, Batch Normalization) | Neural Network (dropout [0.4,0.2], Batch Normalization) | Neural Network (Adam,dropout [0.4,0.2] ,He initialization) | |
|---|---|---|---|---|---|---|---|
| Accuracy | 0.972527 | 0.994505 | 0.980769 | 0.975275 | 0.986264 | 0.978022 | 0.975275 |
| Recall | 0.972527 | 0.994505 | 0.980769 | 0.975275 | 0.986264 | 0.978022 | 0.975275 |
| Precision | 0.972669 | 0.994553 | 0.980813 | 0.975327 | 0.986259 | 0.978148 | 0.975327 |
| F1 Score | 0.972567 | 0.994497 | 0.980783 | 0.975293 | 0.986253 | 0.978054 | 0.975293 |
models_valid_comp_df
| Neural Network (SGD, No Regularization) | Neural Network (SGD with Momentum, No Regularization) | Neural Network (Adam , No Regularization) | Neural Network (Adam, dropout [0.4,0.2]) | Neural Network (Adam, Batch Normalization) | Neural Network (dropout [0.4,0.2], Batch Normalization) | Neural Network (Adam,dropout [0.4,0.2] ,He initialization) | |
|---|---|---|---|---|---|---|---|
| Accuracy | 0.956140 | 0.978022 | 0.967033 | 0.967033 | 0.945055 | 0.945055 | 0.956044 |
| Recall | 0.956140 | 0.978022 | 0.967033 | 0.967033 | 0.945055 | 0.945055 | 0.956044 |
| Precision | 0.956869 | 0.978022 | 0.967465 | 0.967465 | 0.948008 | 0.945604 | 0.957476 |
| F1 Score | 0.955776 | 0.978022 | 0.967126 | 0.967126 | 0.945473 | 0.945210 | 0.956279 |
models_train_comp_df.loc["F1 Score"] - models_valid_comp_df.loc["F1 Score"]
Neural Network (SGD, No Regularization) 0.016792 Neural Network (SGD with Momentum, No Regularization) 0.016475 Neural Network (Adam , No Regularization) 0.013657 Neural Network (Adam, dropout [0.4,0.2]) 0.008167 Neural Network (Adam, Batch Normalization) 0.040780 Neural Network (dropout [0.4,0.2], Batch Normalization) 0.032844 Neural Network (Adam,dropout [0.4,0.2] ,He initialization) 0.019013 Name: F1 Score, dtype: float64
# clears the current Keras session, resetting all layers and models previously created, freeing up memory and resources.
tf.keras.backend.clear_session()
#Initializing the neural network
model = Sequential()
model.add(Dense(14,activation="relu",kernel_initializer="he_normal",input_dim=X_train.shape[1]))
model.add(Dropout(0.4))
model.add(Dense(7,activation="relu",kernel_initializer="he_normal"))
model.add(Dropout(0.2))
model.add(Dense(1,activation="sigmoid",kernel_initializer="he_normal"))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 14) 434
dropout (Dropout) (None, 14) 0
dense_1 (Dense) (None, 7) 105
dropout_1 (Dropout) (None, 7) 0
dense_2 (Dense) (None, 1) 8
=================================================================
Total params: 547 (2.14 KB)
Trainable params: 547 (2.14 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
optimizer = tf.keras.optimizers.Adam() # defining Adam as the optimizer to be used
model.compile(loss='binary_crossentropy', optimizer=optimizer)
start = time.time()
history = model.fit(X_train, y_train, validation_data=(X_valid,y_valid) , batch_size=batch_size, epochs=epochs,class_weight=cw_dict)
end=time.time()
Epoch 1/50 6/6 [==============================] - 2s 134ms/step - loss: 3.0472 - val_loss: 1.0760 Epoch 2/50 6/6 [==============================] - 0s 33ms/step - loss: 2.9340 - val_loss: 0.9569 Epoch 3/50 6/6 [==============================] - 0s 33ms/step - loss: 2.5464 - val_loss: 0.8550 Epoch 4/50 6/6 [==============================] - 0s 27ms/step - loss: 2.4196 - val_loss: 0.7682 Epoch 5/50 6/6 [==============================] - 0s 31ms/step - loss: 2.2761 - val_loss: 0.6995 Epoch 6/50 6/6 [==============================] - 0s 29ms/step - loss: 1.9426 - val_loss: 0.6458 Epoch 7/50 6/6 [==============================] - 0s 26ms/step - loss: 1.8588 - val_loss: 0.6012 Epoch 8/50 6/6 [==============================] - 0s 28ms/step - loss: 1.7550 - val_loss: 0.5656 Epoch 9/50 6/6 [==============================] - 0s 25ms/step - loss: 1.6720 - val_loss: 0.5363 Epoch 10/50 6/6 [==============================] - 0s 24ms/step - loss: 1.4011 - val_loss: 0.5104 Epoch 11/50 6/6 [==============================] - 0s 32ms/step - loss: 1.4469 - val_loss: 0.4874 Epoch 12/50 6/6 [==============================] - 0s 25ms/step - loss: 1.3170 - val_loss: 0.4675 Epoch 13/50 6/6 [==============================] - 0s 24ms/step - loss: 1.3022 - val_loss: 0.4495 Epoch 14/50 6/6 [==============================] - 0s 25ms/step - loss: 1.2724 - val_loss: 0.4326 Epoch 15/50 6/6 [==============================] - 0s 33ms/step - loss: 1.2581 - val_loss: 0.4159 Epoch 16/50 6/6 [==============================] - 0s 28ms/step - loss: 1.2275 - val_loss: 0.4001 Epoch 17/50 6/6 [==============================] - 0s 38ms/step - loss: 1.0422 - val_loss: 0.3852 Epoch 18/50 6/6 [==============================] - 0s 38ms/step - loss: 1.0687 - val_loss: 0.3707 Epoch 19/50 6/6 [==============================] - 0s 32ms/step - loss: 1.0640 - val_loss: 0.3563 Epoch 20/50 6/6 [==============================] - 0s 26ms/step - loss: 1.0280 - val_loss: 0.3430 Epoch 21/50 6/6 [==============================] - 0s 27ms/step - loss: 1.0245 - val_loss: 0.3310 Epoch 22/50 6/6 [==============================] - 0s 29ms/step - loss: 0.8967 - val_loss: 0.3193 Epoch 23/50 6/6 [==============================] - 0s 41ms/step - loss: 1.0091 - val_loss: 0.3088 Epoch 24/50 6/6 [==============================] - 0s 42ms/step - loss: 0.9194 - val_loss: 0.2985 Epoch 25/50 6/6 [==============================] - 0s 38ms/step - loss: 0.8379 - val_loss: 0.2881 Epoch 26/50 6/6 [==============================] - 0s 26ms/step - loss: 0.8979 - val_loss: 0.2780 Epoch 27/50 6/6 [==============================] - 0s 28ms/step - loss: 0.8745 - val_loss: 0.2682 Epoch 28/50 6/6 [==============================] - 0s 33ms/step - loss: 0.7602 - val_loss: 0.2584 Epoch 29/50 6/6 [==============================] - 0s 44ms/step - loss: 0.6978 - val_loss: 0.2492 Epoch 30/50 6/6 [==============================] - 0s 18ms/step - loss: 0.7809 - val_loss: 0.2400 Epoch 31/50 6/6 [==============================] - 0s 17ms/step - loss: 0.7427 - val_loss: 0.2312 Epoch 32/50 6/6 [==============================] - 0s 18ms/step - loss: 0.7080 - val_loss: 0.2218 Epoch 33/50 6/6 [==============================] - 0s 12ms/step - loss: 0.6625 - val_loss: 0.2127 Epoch 34/50 6/6 [==============================] - 0s 9ms/step - loss: 0.6583 - val_loss: 0.2046 Epoch 35/50 6/6 [==============================] - 0s 13ms/step - loss: 0.6805 - val_loss: 0.1978 Epoch 36/50 6/6 [==============================] - 0s 9ms/step - loss: 0.6685 - val_loss: 0.1920 Epoch 37/50 6/6 [==============================] - 0s 11ms/step - loss: 0.5765 - val_loss: 0.1865 Epoch 38/50 6/6 [==============================] - 0s 12ms/step - loss: 0.5971 - val_loss: 0.1810 Epoch 39/50 6/6 [==============================] - 0s 11ms/step - loss: 0.6085 - val_loss: 0.1756 Epoch 40/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5357 - val_loss: 0.1704 Epoch 41/50 6/6 [==============================] - 0s 10ms/step - loss: 0.5721 - val_loss: 0.1657 Epoch 42/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4932 - val_loss: 0.1607 Epoch 43/50 6/6 [==============================] - 0s 12ms/step - loss: 0.5452 - val_loss: 0.1562 Epoch 44/50 6/6 [==============================] - 0s 12ms/step - loss: 0.5275 - val_loss: 0.1516 Epoch 45/50 6/6 [==============================] - 0s 9ms/step - loss: 0.5272 - val_loss: 0.1482 Epoch 46/50 6/6 [==============================] - 0s 9ms/step - loss: 0.5130 - val_loss: 0.1448 Epoch 47/50 6/6 [==============================] - 0s 12ms/step - loss: 0.5379 - val_loss: 0.1414 Epoch 48/50 6/6 [==============================] - 0s 13ms/step - loss: 0.5025 - val_loss: 0.1383 Epoch 49/50 6/6 [==============================] - 0s 9ms/step - loss: 0.4604 - val_loss: 0.1353 Epoch 50/50 6/6 [==============================] - 0s 13ms/step - loss: 0.4926 - val_loss: 0.1325
y_train_pred = model.predict(X_train)
y_valid_pred = model.predict(X_valid)
y_test_pred = model.predict(X_test)
12/12 [==============================] - 0s 4ms/step 3/3 [==============================] - 0s 5ms/step 4/4 [==============================] - 0s 4ms/step
print("Classification Report - Train data",end="\n\n")
cr = classification_report(y_train,y_train_pred>0.5)
print(cr)
Classification Report - Train data
precision recall f1-score support
0 0.99 0.99 0.99 228
1 0.98 0.98 0.98 136
accuracy 0.98 364
macro avg 0.98 0.98 0.98 364
weighted avg 0.98 0.98 0.98 364
print("Classification Report - Validation data",end="\n\n")
cr = classification_report(y_valid,y_valid_pred>0.5)
print(cr)
Classification Report - Validation data
precision recall f1-score support
0 0.97 1.00 0.98 57
1 1.00 0.94 0.97 34
accuracy 0.98 91
macro avg 0.98 0.97 0.98 91
weighted avg 0.98 0.98 0.98 91
print("Classification Report - Test data",end="\n\n")
cr = classification_report(y_test,y_test_pred>0.5)
print(cr)
Classification Report - Test data
precision recall f1-score support
0 0.97 1.00 0.99 72
1 1.00 0.95 0.98 42
accuracy 0.98 114
macro avg 0.99 0.98 0.98 114
weighted avg 0.98 0.98 0.98 114
The weighted F1 score on the test data is ~0.74
An F1 score of ~0.98 indicates a good balance between precision and recall, suggesting good performance in accurately classifying instances with minimal false positives and false negatives.
Model can be further tuned to deal with minority class.
Which features are most important in distinguishing between benign and malignant tumors?
radius_mean, perimeter_mean, area_mean, concave points_mean, and concavity_mean have been identified as highly significant in distinguishing between benign and malignant tumors. These features are consistently found to have strong correlations with the diagnosis and are crucial for the classification model.How does the performance of different machine learning models compare in classifying breast tumors?
What is the impact of data preprocessing steps (e.g., normalization, scaling, transformation) on model performance?
How well does the model generalize to unseen data, and what techniques can be used to improve its generalizability?
What are the potential consequences of false negatives and false positives in the classification of breast tumors, and how can the model be optimized to minimize these errors?
we successfully developed a machine learning model to classify breast tumors as benign or malignant. The research questions guided our approach and ensured that we addressed key aspects of the classification task:
radius_mean, perimeter_mean, area_mean, concave points_mean, and concavity_mean.By developing a robust classification model, we have contributed to the early detection and accurate classification of breast tumors, potentially improving patient outcomes and aiding healthcare professionals in making informed decisions.